#5 Basic NLP with Command-line
Faculty of Humanities and Social Sciences
University of Lucerne
31 March 2023
Example location of the course material: /home/alex/KED2023
pwd get the path to the current directorycd .. go one folder upcd FOLDERNAME go one folder down into FOLDERNAMEls -l see the content of the current folderhistorical development of Swiss party politics (Tagesanzeiger)
.txt)
.csv, .tsv, .xml)
Processing a collection of documents (src)
egrep -ir "computational" folder/ # search in all files in folder, ignore case
# common egrep options:
# -i search case-insensitive
# -r search recursively in all subfolders
# --colour highlight matches
# --context 2 show 2 lines above/below match= n_occurrences / n_total_words.tsv filePrint the following sentence in your command line using echo.
How many words are in this sentence? Use the pipe operator to combine the command above with wc.
Match the words computational and colorize its occurences in the sentence using egrep.
Get the frequencies of each word in this sentence using tr and other commands.
🤓 Published code and data are parts of the endeavour of open science.
Change into your local copy of the GitHub course repository KED2023 and update it with git pull. When you haven’t cloned the repository, follow section 5 of the installation guide .
You find some party programmes (Grüne, SP, SVP) in materials/data/swiss_party_programmes/txt. The programmes are provided in plain text which I have extracted from the publicly available PDFs.
Have a look at the content of some of these text files using more.
Compare the absolute frequencies of single terms or multi-word expressions of your choice (e.g., Ökologie, Sicherheit, Schweiz)…
Use the file names as filter to get various aggregation of the word counts.
Pick terms of your interest and look at their contextual use by extracting relevant passages. Does the usage differ across parties or time?
Share your insights with the class using Etherpad.
tsv dataset. Compute the relative word frequency instead of the absolute frequency using any spreadsheet software (e.g. Excel). Are your conclusions still valid after accounting for the size?Pro Tip 🤓: Use egrep to look up commands in the .md course slides
When you look for useful primers on Bash, consider the following resources: